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Abstract 

We consider online similarity prediction problems over networked data. We begin by relat- 
ing this task to the more standard class prediction problem, showing that, given an arbitrary 
algorithm for class prediction, we can construct an algorithm for similarity prediction with 
"nearly" the same mistake bound, and vice versa. After noticing that this general construction 
is computationally infeasible, we target our study to feasible similarity prediction algorithms on 
networked data. We initially assume that the network structure is known to the learner. Here 
we observe that Matrix Winnow [UJ has a near-optimal mistake guarantee, at the price of cubic 
prediction time per round. This motivates our effort for an efficient implementation of a Percep- 
tron algorithm with a weaker mistake guarantee but with only poly-logarithmic prediction time. 
Our focus then turns to the challenging case of networks whose structure is initially unknown 
to the learner. In this novel setting, where the network structure is only incrementally revealed, 
we obtain a mistake-bounded algorithm with a quadratic prediction time per round. 



1 Introduction 

The study of networked data has spurred a large amount of research efforts. Applications like spam 
detection, product recommendation, link analysis, community detection, are by now well-known 
tasks in Social Network analysis and E-Commerce. In all these tasks, networked data are typically 
viewed as graphs, where vertices carry some kind of relevant information (e.g., user features in 
a social network), and connecting edges reflect a form of semantic similarity between the data 
associated with the incident vertices. Such a similarity ranges from friendship among people in a 
social network to common user's reactions to online ads in a recommender system, from functional 
relationships among proteins in a protein-protein interaction network to connectivity patterns in 
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a communication network. Coarsely speaking, similarity prediction aims at inferring the existence 
of new pairwise relationships based on known ones. These pairwise constraints, which specify 
whether two objects belong to the same class or not, may arise directly from domain knowledge or 
be available with little human effort. 

There is a wide range of possible means of capturing the structure of a graph in this learn- 
ing context: through combinatorial and classical graph-theoretical methods (e.g., [HI]); through 
spectral approachs (e.g. [31 [22]), using convex duality and resistive geometry (e.g., [M!)> and even 
algebraic methods (e.g., |34j). In many of these approaches, the underlying assumption is that the 
graph structure is largely known in advance (a kind of "transductive" learning setting), and serves 
as a way to bias the inference process, so as to implement the principle that "connected vertices 
tend to be similar." Yet, this setting is oftentimes unrealistic and/or infeasible. For instance, a 
large online social network with millions of vertices and tens of millions of edges hardly lends itself 
to be processed as a whole via a Laplacian-regularized optimization approach or, even if it does 
(thanks to the computationally powerful tools currently available), it need not be known ahead of 
time. As a striking example, if we are representing a security agency, and at each point in time we 
receive a "trace" of communicating individuals, we still might want to predict whether a given pair 
in the trace belong to the same "gang" /community, even if the actual network of relationships is 
unknown to us. So, in this case, we are incrementally learning similarity patterns among individ- 
uals while, at the same time, exploring the network. Another important scenario of an unknown 
network structure is when the network itself grows with time, hence the prediction algorithms are 
expected to somehow adapt to its temporal evolution. 

Our results. We study online similarity prediction over graphs in two models. One in which the 
graph is fcnowrj^a priori to the learner, and one in which it is unknown. In both settings there is an 
undisclosed labeling of a graph so that each vertex is the member of one of K classes. Two vertices 
are similar if they are in the same class and dissimilar otherwise. The learner receives an online 
sequence of vertex pairs and similarity feedback. On the receipt of a pair the learner then predicts 
if the pair is similar. The true pair label, similar or dissimilar, is then received and the goal 
of the learner is to minimize mistaken predictions. Our aim in both settings is then to bound the 
number of prediction mistakes over an arbitrary (and adversarially generated) sequence of pairs. 

In the model where the graph is known, we first show via reductions to online vertex classifi- 
cation methods on graphs (e.g., [231 EH [251 EH [29l [28l HH H21 [J3] , and references therein), that 
a suitable adaptation of the Matrix Winnow algorithm [37] readily provides an almost optimal 
mistake bound. This adaptation amounts to sparsifying the underlying graph G via a random 
spanning tree, whose diameter is then shortened by a known rebalancing technique [29!; [12]. Un- 
fortunately, due to its computational burden (cubic time per round), the resulting algorithm does 
not provide a satisfactory answer to actual deployment on large networks. Therefore, we develop 
an analogous adaptation of a Matrix Perceptron algorithm that delivers a much more attractive 
answer (thanks to its poly-logarithmic time per round), though with an inferior online prediction 
performance guarantee. 

The unknown model is identical to the known one, except that the learner does not initially 
receive the underlying graph G. Rather, G is incrementally revealed, as now when the learner 
receives a pair it also receives as side information an adversarially generated path within G con- 

The reader should keep in mind that while the data at hand may not be natively graphical, it might still be 
convenient in practice to artificially generate a graph for similarity prediction, since the graph may encode side 
information that is otherwise unexploitable. 
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necting the vertices of the pair. Here, we observe that the machinery we used for the known graph 
case is inapplicable. Instead, we design and analyze an algorithm which may be interpreted as 
a matrix version of an adaptive p-norm Perceptron |20| [T7] with the relatively efficient quadratic 
running time per round. 

Related work. This paper lies at the intersection between online learning on graphs and 
matrix/metric learning. Both fields include a substantial amount of work, so we can hardly do it 
justice here. Below we outline some of the main contributions in matrix/metric learning, with a 
special emphasis on those we believe are most related to this paper. Relevant papers in online class 
prediction on graphs will be recalled in Section [3} 

Similarity prediction on graphs can be seen as a special case of matrix learning. Relevant works 
on this subject include [Ml HH QUI E] - see also [21] for recent usage in the context of online 
cut prediction. In all these papers, special care is put into designing appropriate regularization 
terms driving the online optimization problem, the focus typically being on spectral sparseness. 
When operating on graph structures with Laplacian-based regularization, these algorithms achieve 
mistake bounds depending on functions of the cut-size of the labeled graph - see Section |4| Yet, in 
the absence of further efforts, their scaling properties make them inappropriate to practical usage 
in large networks. Metric learning is also relevant to this paper. Metric learning is a special case of 
matrix learning where the matrix is positive semi-definite. Relevant references include [45\ \TE\ [391 
H9|[9]. Some of these papers also contain generalization bound arguments. Yet, no specific concerns 
are cast on networked data frameworks. Related to our bidirectional reduction from class prediction 
to similarity prediction is the thread of papers on kernels on pairs (e.g., [2"1 139 1 135 1 [6] ) . where kernels 
over pairs of objects are constructed as a way to measure the "distance" between the two referenced 
pairs. The idea is then to combine with any standard kernel algorithm. The so-called matrix 
completion task (specifically, the recent reference [32]) is also related to our work. In that paper, 
the authors introduce a matrix recovery method working in noisy environments, which incorporates 
both a low-rank and a Laplacian-regularization term. The problem of recovery of low-rank matrices 
has extensively been studied in the recent statistical literature (e.g., [H [181 BH EH EH]) and 
references therein), the main concern being bounding the recovery error rate, but disregarding the 
computational aspects of the selected estimators. Moreover, the way they typically measure error 
rate is not easily comparable to online mistake bounds. Finally, the literature on semisupervised 
clustering/clustering with side information ([HHE] _ see also |43] for a recent reference on spectral 
approaches to clustering) is related to this paper, since the similarity feedback can be interpreted 
as a must-link/cannot-link feedback. Nonetheless, their formal statements are fairly different from 
ours. 

To summarize, whereas we are motivationally close to [32], from a technical viewpoint, we are 
perhaps closer to [IH 061 H7J [TQl HH [31], as well as to the literature on online learning on graphs. 

Before delving into the graph-based similarity problem, we start off by investigating the problem 
of similarity prediction in abstract terms, showing that similarity prediction reduces to classification, 
and vice versa. This will pave the way for all later results. 

2 Online class and similarity prediction 

In this section we examine the correspondence in predictive performance (mistake bounds) between 
the classification and similarity prediction frameworks. 

Preliminaries. The set of all finite sequences from a set X is denoted X . We use the Iverson 
bracket notation [predicate] = 1 if the predicate is true and [predicate] = if false. In K-class 
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prediction in the online mistake bound model, an example sequence (x±, y±), . . . , (xt, Vt) G x J) 
is revealed incrementally, where X is a set of patterns and y := {1, . . . , K} is the set of K class labels. 
The goal on the i-th trial is to predict the class yt given the previous t — 1 pattern/label pairs and xt- 
The overall aim of an algorithm is to minimize the number of its mistaken predictions. In similarity 
prediction, examples are pairs of patterns with "similarity" labels i.e., ((x' ,x"),y) G X 2 x y s with 
y s = {0, 1}. We interpret y G y s as similar if y = and dissimilar if y = 1; we also introduce 
the convenient function sim(y',y") := 1 — [y ; = y"] which maps a pair of class labels y',y" G y to 
a similarity label. A concept is a function / : X — )• y that maps patterns to labels. An example 
sequence S is consistent with a concept / for classification if (x,y) G S implies y = f(x) and for 
similarity if G S implies y = sim(f(x'),f(x")). We use Ma(S) to denote the number 

of prediction mistakes of the online algorithm A on example sequence S. Given an algorithm A, 
we define the mistake bound with respect to a concept / as ~B A (f) := maxg M A (S), the maximum 
being over all sequences S consistent with /. 

Theorem 1. Given an online classification algorithm A c one may construct a similarity algorithm 
A s such that if S is any similarity sequence consistent with any concept f then 

M As (S)<5M Ac (f)log 2 K, (1) 

and given an online similarity algorithm A s one may construct a classification algorithm A c such 
that if S is any classification sequence consistent with any concept f then 

M Ac (S)<M As {f) + K. (2) 



The direct implementation of the similarity algorithm A s from the classification algorithm A c 



is infeasible, as its running time is exponential in the mistake bound. In Appendix A.l we prove a 



more general result (see Lemma [7]) than in equation ([I]) which applies also to noisy sequences and to 



"order-dependent" bounds, as in the shifting-expert bounds in [26J. We also argue (Appendix A. 1.1 ) 
that the "log if" term in is necessary. Observe that equation ^ implies a lower bound for 
similarity prediction if we have a lower bound for the corresponding class prediction problem with 
only a weakening by an additive ll —K" term. 

3 Class and similarity prediction on graphs 

We now introduce notation specific to the graph setting. Let then G = (V, E) be an undirected and 
connected graph with n = \ V\ vertices, V = {1, . . . ,n}, and m = \E\ edges. The assignment of K 
class labels to the vertices of a graph is denoted by a vector y = (y 1; . . . , y n ), where yi G {1, . . . , K} 
denotes the label of the i-th vertex among the K possible labels. The vertex-labeled graph will 
often be denoted by the pairing (G,y). Associated with each pair (i,j) G V 2 of (not necessarily 
adjacent) vertices is a similarity label yij G {0, 1}, where yij = 1 if and only if yi ^ yj. As is typical 
of graph-based prediction problems (e.g., (23J IMl 1251 E7J [291 EH EH E21 E] , and references therein), 
the graph structure plays the role of an inductive bias, where adjacent vertices tend to belong to 
the same class. The set of cut-edges in (G,y) is denoted as & G (y) := G E : yij = 1} 

(when nonambiguous, we abbreviate it to $ G ), and the associated cut-size as |$ G (y)|. The set of 
cut-edges with respect to class label k is denoted as ^(y) := {(i, j) G E : k G {yi,yj}, yij = 1} 
(when nonambiguous, we abbreviate it to Notice that Y%=\ \®k = 2 \$ G (y)\- We let * be 
the m x n (oriented and transposed) incidence matrix of G. Specifically, if we let the edges in E be 
enumerated as (ii,ji), ■ ■ ■ , (i m ,jm), and fix arbitrarily an orientation for them (e.g., from the left 
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endpoint to the right endpoint), then ^ is the matrix that maps any vector v = (v±, . . . , v n ) T G M n 
to the vector ^fv 6 IR m , where [^v]g = Vi t — Vj t , £ = 1, . . . , m. Moreover, since G is connected, the 
null space of ^ is spanned by the constant vector 1 = (1, . . . , 1) T , that is, = implies that 
v = cl, for some constant c. We denote by the (n x m-dimensional) pseudoinverse of \&. The 
graph Laplacian matrix may be defined as L := ty T ^S>, thus notice that L + = Vl/ + (vE ,+ ) . If G is 
identified with a resistive network such that each edge is a unit resistor, then the effective resistance 
Rfj between a pair of vertices 6 V 2 can be defined as Rfj = (e.{ — ej) T L + (ei — ej), where e\ 
is the i-th vector in the canonical basis of W 1 . When (i,j) € E then Rf- also equals the probability 
that a spanning tree of G drawn uniformly at random (from the set of all spanning trees of G) 
includes (i, j) as one of its n — 1 edges (e.g., [38]). The resistance diameter of G is max(jj) e y2 Rfy 
It is known that the effective resistance defines a metric over the vertices of G. Moreover, when G 
is actually a tree, then Rf - corresponds to the number of edges in the (unique) path from % to j. 
Hence, in this case, the resistance diameter of G coincides with its (geodesic) diameter. 

3.1 Class prediction on graphs 

Roughly speaking, algorithms and bounds for sequential class prediction on graphs split between 
two types: Those which approximate the original graph with a tree or those that maintain the 
original graph. By approximating the graph with a tree, extremely efficient algorithms are obtained 
with strong optimality guarantees. By exploiting the full graph, algorithms are obtained which 
take advantage of the connectivity to achieve sharp bounds when the graph contains, e.g., dense 
clusters. Relevant literature on this subject includes [271 EH EH EH EHl HH EH [12]. Known 
representatives of the first kind are upper bounds of the form C(|$ T |(1 + log t^tt)) [29j or of the 

form C(|$ T | log Z?t) [H], where T is some spanning tree of G, and Dt is the (geodesic) diameter 
of T. In particular, if T is drawn uniformly at random, the above turn to bounds on the expected 
number of mistakes of the form 0(E[|<3? T |] logn), where E[|$ T |] is the resistance- weighted cut-size 
of G, E[|$ T |] = Y^(i j)^ G {y) Rfj ' which can be far smaller than |$ G | when G is well connected. 
Representatives of the second kind are bounds of the form 0(p + |<E> G | R p ) [23], [23] , where p is the 
number of balls in a cover of the vertices of G such that R p is the maximum over the resistance 
diameters of the balls in the cover. Since resistance diameter lower bounds geodesic diameter, this 
alternative approach leverages a different connectivity structure of the graph than the resistance- 
weighted cut-size. 

In all of the above mentioned works, the bounds and algorithms are for the K = 2 class 



prediction case. In Appendix A. 2 we argue for a simple reduction that will raise a variety of 
cut-based algorithms and bounds from the two-class to the JC-class case. Specifically, a two-class 
mistake bound of the form M < c\<fr G (y)\ \/y £ {0,l} n , for some c > easily turns into a K- 
class mistake bound of the form M < 2c\Q G (y)\ \/y € {1, . . . , K} n , where K need not be known in 
advance to the algorithm. Therefore, bounds of the form C(E[|$ T |] logn) also hold in the multiclass 
setting. 

On the lower bound side, contains an argument showing (for the K = 2 class case) that 
for any <fi > a labeling y exists such that any algorithm will make at least 0/2 mistakes while 
E[|<E> T |] < <p. In short, f2(E[|<I> T |]) is also a lower bound on the number of mistakes in the class 
prediction problem on graphs. When combined with Theorem [T] in Section [2j the above results 
immediately yield upper and lower bounds for the similarity prediction problem over graphs. 

Proposition 1. Let (G,y) be a labeled graph, and T be a random spanning tree of G. Then 
an algorithm exists for the similarity prediction problem on G whose expected number of mistakes 
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E[M] satisfies E[M] = 0(E[\<S> T (y)\] log .FT logn) . Moreover, for any cj) > a K -class labeling y 
exists such that any similarity prediction algorithm on G will make at least <p/2 — K mistakes while 
E[|$ T |] < 4>. 

The upper bound above refers to a computationally inefficient algorithm and, clearly enough, 
more direct version space arguments would lead to similar results. Section [4] contains a more 
efficient approach to similarity prediction over graphs. 

To close this section, we observe that the upper bounds on class predictions of the form 
0(E[|<J> r |] logn) taken from |29l 112] are essentially relying on linearizing the graph G into a path 
graph, and then predicting optimally on it via an efficient Bayes classifier (aka Halving Algorithm, 
e.g., |36j). One might wonder whether a similar approach would directly apply to the similarity 
prediction problem. We now show that exact computation of the probabilities of the Bayes classifier 
for a path graph is #P-complete under similarity feedback. 

The IsiNG distribution over graph labelings (y G {1, 2}™) is defined as p(y) oc Given 
a set of vertices and associated labels, the marginal distribution at each vertex can be computed 
in linear time when the graph is a path (|42j). In [29] this simple fact was exploited to give 
an efficient class prediction algorithm by a particular linearization of a graph to a path graph. 
The equivalent problem in similarity prediction requires us to compute marginals given a set of 
pairwise constraints. The following theorem shows that computing the partition function (and 
hence the relevant marginals) of the Ising distribution on a path with pairwise label constraints is 
#P-complete. 

Theorem 2. Computing the partition function of the (ferromagnetic) IsiNG model on a path graph 
with pairwise constraints is #P-complete, where an INSTANCE is an n-vertex path graph P, a set 
of pairs C C {1, . . . , n} 2 , and a natural number, (3, presented in unary, and the desired Output is 
the value of the partition function, Zp(C, (3) := Yl y <^{i 2}™ {(y l =y.,)} ( )gC 2~^ . 

Thus computing the exact marginal probabilities on even a path graph will be infeasible (given 
the hardness of #P). As an alternative, in the following section we discuss the application of the 
Matrix Perceptron and Matrix Winnow algorithms to similarity prediction. 

3.2 Similarity prediction on graphs 

In Algorithm[T]we give a simple application of the Matrix Winnow (superscript "w") and Perceptron 
(superscript "p" ) algorithms to similarity prediction on graphs. The key aspect of the construction 
(common to many methods in metric learning) is the creation of rank one matrices which correspond 
to similarity "instances" (see Q). We then use the standard analysis of the Perceptron |41j 
and Matrix Winnow |47j algorithms with appropriate thresholds to obtain Proposition [2j A key 
observation is that the squared Frobenius norm of the (un-normalized) instance matrices is bounded 
by the squared resistance diameter of the graph, and the squared Frobenius norm of the (un- 
normalized) "comparator" matrix is bounded by the cut-size squared |<1? G | 2 . 

Proposition 2. Let (G,y) be a labeled graph and let \E' be the (transposed) incidence matrix as- 
sociated with the Laplacian of G. Then, if we run the Matrix Winnow and Perceptron algorithms 
with similarity instances constructed from fy, we have the following mistake bounds: 

M w = o(\<S> G \ max i?f,logn) and M p = o(\$ G \ 2 max (SPA*) . 
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Algorithm 1: Perceptron and Matrix Winnow algorithms on a graph 
Input: Graph G = (V, E), \V\ = n, with Laplacian L = ty'ty, and R := max( ij ) g y2 Rff, 
Parameters: Perceptron threshold 9 P = R 2 ; Winnow threshold 9 W = eV \- v ny$>G\ > 
Winnow learning rate r\ = 1.28; 

Initialization: Wq = £ R™x™ ; W ™ = _L j e R mxm. 
For t = 1,2, ...,T ; 

• Get pair of vertices (it,jt) G U 2 , and construct similarity instances, 

• Predict: yf = [tr((W£_ 1 ) t Xf) > 9 P }; yj = [TR((W^ x ) T X t w ) > ^ w ]; 

• Observe yt G {0, 1} and, if mistake (yt / yt), update 

Wf <- W?_ x + (y t - yf) Xf; log W? <- log ^ x + r, (y t - y?) XT . 



A severe drawback of both these algorithms is that on a generic graph, initilization requires 
computing a pseudo-inverse (typically cubic time) , and furthermore the update of Matrix Winnow 
requires a cubic-time computation of an eigendecomposition (to compute matrix exponentials) 
on each mistaken trialj^] In the following section, we focus on a construction based on a graph 
approximation for which we develop an efficient implementation of the Perceptron algorithm which 
will require only poly-logarithmic time per round. 



4 Efficient similarity prediction on graphs 

Relying on the notation of Section [3j we turn to efficient similarity prediction on graphs. We 
present adaptations of Matrix Winnow and Matrix Perceptron to the case when the original graph 
G is sparsified through a linearized and rebalanced random spanning tree of G. This sparsification 
technique, called Binary Support Tree (BST) in [29], brings the twofold advantage of yielding 
improved mistake bounds and faster prediction algorithms. More specifically, the use of a BST 
replaces the (perhaps very large) resistance diameter term max^j) 6 y2 Rfj in the mistake bounds 
of Proposition [2] by a logarithmic term, the other term in the mistake bound becoming (when 
dealing with the expected number of mistakes) only a logarithmic factor larger than the (often far 
smaller) sum of the resistance- weighted cut-sizes in a spanning tree. Moreover, when combined with 
the Perceptron algorithm, a BST allows us to develop a very fast implementation whose running 
time per round is poly-logarithmic in n, rather than cubic, as in Matrix Winnow-like algorithms. 

Recall that a uniformly random spanning tree of an unweighted graph can be sampled in 
expected time O(nlnn) for "most" graphs [5]. Using the nice algorithm of |48| . the expected 
time reduces to 0(n) — see also the work of pp. However, all known techniques take expected time 
0(n 3 ) in certain pathological cases. 

In a nutshell, a BST B of G is a full balanced binary tree whose leaves correspond to the vertices 

2 Additionally, there is a tuning issue related to Matrix Winnow, since the threshold 6* w depends on the (unknown) 
cut-size. 
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Figure 1: Prom left to right: The graph G; a random spanning tree T of G (note that the vertices are 
numbered by the order of a depth- first visit of T, starting from root vertex 1; the path graph P which follows 
the order of the depth-first visit of T; the BST built on top of P. Notice how the class labels of the 8 vertices 
in V (corresponding to the three colors) are propagated upwards. 



irj^] V. In order to construct B from G, we first extract a random spanning tree T of G, then we 
visit T through a depth- first visit, order its vertices according to this visit eliminating duplicates 
(thereby obtaining a path graph P), and finally we build B on top of P. Since B has 2n — 1 vertices, 
we extend the class labels from leaves to internal vertices by letting, for each internal vertex i of B, 
Ui be equal to the class label of i's left child. Figure [T] illustrates the process. A simple adaptation 
of [29j (Section 6 therein) shows that for any class k = 1, . . . , K we have |$jr | < 2 log 2 n. With 



the above handy, we can prove the following bounds (see Appendix A. 3 for further details). 



Theorem 3. Let (G,y) be a labeled graph, T be a random spanning tree ofG, B be the correspond- 
ing BST, and \I/ b be the (transposed) incidence matrix associated with B. 

1. If we run Matrix Winnow with similarity instances constructed from '5 b (see Algorithm^ 
then the expected number of mistakes E[M] on G satisfies E[M] = O (cp log 3 n\ , 

2. and if we run the Matrix Perceptron algorithm with similarity instances constructed from \t b 
then E[M] = O (cp 2 log 4 n) , 

where we denote the resistance-weighted cut-size as cp = E[|<£ T |] = Y^u j)e$c Rfj- 

The bound for Matrix Winnow is optimal up to a log 3 n factor — compare to the lower bound 
in Proposition [T] However, this tight bound is obtained at the cost of having an algorithm which 
is C(n 3 ) per round, even when run on a tree. This is because matrix exponentials require storing 
and updating a full SVD of the algorithm's weight matrix at each round, thereby making this 
algorithm highly impractical when G is large. On the other hand, the Perceptron bound is sig- 
nificantly suboptimal (due to its dependence on the squared resistance- weighted cut-size), but it 
has the invaluable advantage of lending itself to a very efficient implementation: Whereas a naive 
implementation would lead to an 0(n 2 ) running time per round, we now show that a more in- 
volved implementation exists which takes only 0(log 2 n), yielding an exponential improvement in 
the per-round running time. 



4.1 Implementing Matrix Perceptron on BST 

The algorithm operates on the BST B by maintaining a (2n — 1) x (2n — 1) symmetric matrix 
F with integer entries initially set to zero. At time t, when receiving the pair of leaves (it,jt), 

3 We assume w.l.o.g. that n = \ V\ is a power of 2. Otherwise, we may add dummy "leaves". 
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Figure 2: Matrix Perceptron algorithm at time t with i t = 2 and j f = 5. Left: The BST. Light blue 
vertices are those in Vt- Thick-bordered vertices are those in St- The ft tags are the red numbers near the 
involved vertices (i.e., those in Vt U St). Middle: The matrix F. In light blue are the entries of F that are 
summed over in Q. Right: The matrix F, where numbers are the values (ft(£) — ft(£')) 2 that are added 
to (yi tt j t = 1, Vi t ,j t = 0) or subtracted from (yi t j t — 0, Vi u j t — 1) the respective components of F during the 
update step (f5j). 



the algorithm constructs Vt, the (unique) path in B connecting it to jt- Then the prediction 
yi t j t ^ {0' 1} i s computed as 



otherwise . 



Fu> > 41og 2 n, 



(4) 



Upon receiving label yi t j t , the algorithm updates F as follows. First of all, the algorithm is mistake 
driven, so an update takes place only if yi u j t / y% t ,j t - Let A/j be the set of neighbors of the vertices 
in Vt, and define St := Mt \ {Vt \ {it,jt})- We recursively assign integer tags ft(£) to vertices i € M% 
as follows: 1. For all £ £ Vt, if £ is the s-th vertex in Vt then we set f t {£) = s; 2. For all £ G Nt\V t , 
let ni be the (unique) neighbor of I that is contained in Vt- Then we set ft(£) = ft(ne)- We then 
update F on each pair (£, £') £ 5 4 2 as 



Ft 



Ft 



+ (2y iuH -l)(f t (l)-ft(l')) 2 . (5) 
Figure [2] illustrates the process. The following theorem is the main technical result of this section. 



Its involved proof is given in Appendix A. 3 



Theorem 4. Let B be a BST of a labeled graph (G, y) with \ V\ = n. Then the algorithm described 
by and |5p is equivalent to Matrix Perceptron run with similarity instances constructed from 
^B- Moreover, the algorithm takes C(log 2 n) per trial, and there exists an adaptive representation 
of F with an initialisation time of only 0(n) (rather than 0(n 2 )). 



5 The Unknown Graph Case 

We now consider the case when the graph G = (V, E) is unknown to the learner beforehand. The 
graph structure is thus revealed incrementally as more and more pairs (it,jt) get produced by the 
adversary. A reasonable online protocol that includes progressive graph disclosure is the following. 
At the beginning of round t = 1 the learner knows nothing about G, but the number of vertices n 
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- prior knowledge of n makes presentation easier, but could easily be removed from this setting. 
In the generic round t, the adversary presents to the learner both pair (it,jt) £ V x V and a path 
within G from it to jt ■ The learner is then compelled to predict whether or not the two vertices are 
similar. Notice that, although the presented path may have cut-edges, there might be alternative 
paths in G connecting the two vertices with no cut-edges. The learner need not see them. The 
adversary then reveals the similarity label y% t j t in G, and the next round begins. In this setting, 
the adversary has complete knowledge of G, and can decide to produce paths and place the cut- 
edges in an adaptive fashion. Notice that, because of the incremental disclosure of G, no such 



constructions as ^-based similarity instances and/or BST, as contained in Section 3.2 and Section 
[4j are immediately applicable. 

As a simple warm-up, consider the case when G is a tree and the K class label sets of ver- 
tices (henceforth called clusters) correspond to connected components of G. Figure [3] (a) gives 
an example. Since the graph is a tree, the number of cut-edges equals K — 1. We can associate 
with such a tree a linear-threshold function vector u = (m, . . . , u„_i) T G {0, l} n_1 , where Ui is 
1 if and only if the i-ih. edge is a cut-edge. The ordering of edges within u can be determined 
ex-post by first disclosure times. For instance, if in round t = 1 the adversary produces pair 
(6,4) and path 6 — > 3 — > 1 — > 4 (Figure [3] (a)), then edge (6,3) will be the first edge, (3, 1) will 
be the second, and (1,4) will be the third. Then, if in round t = 2 the new pair is (3,5) and 
the associated path is 3 — > 1 — > 5, the newly revealed edge (1,5) will be the fourth edge within 
u. With this ordering in mind, the algorithm builds at time t the (n — l)-dimensional vector 
x t = (xij, ■ ■ ■ , x n -i,t) T £ {0, l}™^ 1 corresponding to the path disclosed at time t, where x^t is 
1 if and only if the i-th edge belongs to the path. Now, it is clear that yi t j t = 1 if u T Xt > 1, 
and Vi t j t = if u T xt = 0. Therefore, this turns out to be a sparse linear-threshold function 
learning problem, and a simple application of the standard Winnow algorithm |36J leads to an 
0(K logn) = O (\& G \ logn) mistake bound obtained by an efficient (0(n) time per round) algo- 
rithm, independent of the structural properties of G, such as its diameter. 

One might wonder if an adaptation of the above procedure exists which applies to a general 
graph G by, say, extracting a spanning tree T out of G, and then applying the Winnow algorithm 
on T. Unfortunately, the answer is negative for at least two reasons. First, the above linear- 
threshold model heavily relies on the fact that clusters are connected, which need not be the case 
in our similarity problem. More critically, even if the clusters are connected in G, they need not 
be connected in T. Figure [3] (b)-(c) shows a typical example where Winnow applied to a spanning 
tree fails. Given this state of affairs, we are lead to consider a slightly different representation for 
pairs of vertices and paths. Yet, as before, this representation will suggest a linear separability 
condition, as well as the deployment of appropriate linear-threshold algorithms. 

5.1 Algorithm and analysis 

Algorithm [2] contains the pseudocode of our algorithm. When interpreted as operating on vectors, 
the algorithm is simply an r-norm perceptron algorithm |20^ IT7] with nonzero threshold, and norm 
r = 21og(n — l) 2 = 41og(n — 1), being (n — l) 2 the length of the vectors maintained throughout, 
and s the dual to norm r. At time t, the algorithm observes pair (it,jt) and path pu t ^.j t \, builds 
the instance vector Xt £ {— l,0,l} ra_1 and the long vector VEC(Xt) out of the rank-one matrix 
Xt = xtxj , where vec(-) is the standard vectorization of a matrix that stacks its columns one 
underneath the other. In order to construct Xt from P(i t ->j t ), the algorithm maintains a forest 
made up of the union of paths seen so far. If pair (it,jt) is already connected by a path p in the 
current forest, then Xt is the instance vector associated with path p (as for the Winnow algorithm on 
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(a) (b) (c) 

Figure 3: (a) A tree with 3 clusters corresponding to the 3 depicted connected components. Edges and 
e§ are the cut-edges. Assuming edges are initially revealed in the order of their subscripts, the associated 
vector u is u — (0, 1, 0, 0, 1, 0) T . The algorithm receives at time t = 1 the pair (6,4) along with path 
6 — > 3 — > 1 — > 4 (so that the 3 thick edges are revealed in the first round). The associated feature vector is 
Xi = (1, 1, 1,0, 0, 0) T . Vertices 6 and 4 are disconnected as u T Xi > 1. (b)-(c) The connectivity structure 
induced by the thick-edged spanning tree on the blue cut in (b). Vertices 4, 5, 6, and 7 are all connected in 
G under the blue cut (b) , but are they are all disconnected in T (c) . 

Algorithm 2: r-norm Perceptron for similarity prediction in unknown graphs. 



Input: Number of vertices n = |V|, V = {1, ... , n}, n > 3; 

r . 

— r -i ! 



Initialization : wq = 06 R^™ 1 ';r = 41og(n — 1) 



For t = 1,2, . . . ,T : 

• Get pair of vertices (it,jt) G V 2 , and path Pa t -^.j t \- Construct instance vector 
Xt £ {—1, 0, l} n_1 as explained in the main text; 

• Build (n — l) 2 -dimensional vector VEC(Xf), where X t = x t xj , and predict yt £ {0, 1} 



Vt 



1 MwJ_ x \^G{X t )>{r-\)\\x t \\^ 
otherwise, 



Observe yt £ {0, 1} and, if mistake {yt / yt), update 

f(w t )^f(wt-i) + (yt-yt)vvc(Xt), f(w) = V\\w\\ 2 s /2 . 



a tree in the previous section, but taking edge orientations into account - see Figure [7] in Appendix 
for details). Otherwise, path Pu t -+j t ) is added to the forest and x t will be an instance vector 



A.4 



associated with the new path Pu t -^j t )- In adding the new path to the forest, we need to make 
sure that no circuits are generated. In particular, as soon as a revealed edge in a path causes two 
subtrees to join, the algorithm merges the two subtrees and processes all remaining edges in that 
path in a sequential manner so as to avoid generating circuits. The algorithm will end up using a 
spanning tree T of G for building its instance vectors xt. This spanning tree is determined on the 
fly by the adversarial choices of pairs and paths, so it is not known to the algorithm ahead of time|^] 
But any later change to the spanning forest is designed so as to keep consistency with all previous 
vectors x t . 

The decision threshold (r — l)||a5f||^ = (r — 1)| |vEC(Xt)| | 2 follows from a standard analysis of 
the r-norm perceptron algorithm with nonzero threshold (easily adapted from |20^ I17j). as well 
as the update rule. In short, since the graph is initially unknown, the algorithm is pretending to 
learn vectors rather than (Laplacian-regularized) matrices, and relies on a regularization that takes 
advantage of the sparsity of such vectors. The analysis of Theorem [5] below rests on ancillary (and 



4 In fact, because the algorithm is deterministic, this spanning tree is fully determined by the adversary. We are 
currently exploring to what extent randomization is beneficial for an algorithm in this setting. 
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classical) properties of matroids on graphs. These are recalled in Appendix A. 4 before the proof 
of the theorem. 

Theorem 5. With the notation introduced in this section, let Algorithm^ be run on an arbitrary 
sequence of pairs (h,ji), {12, 32), • • • and associated sequence of paths pu 1 -tj 1 ),P(i 2 -+j2)> ■ ■ ■ ■ Then 
we have the mistake bound M = O (|$ | 4 logn) . 

Remark 1. As explained in the proof of Theorem^ the separability condition |5|) allows one to 
run any vector or matrix mirror descent linear-threshold algorithm. In particular, since matrix U 
therein is spectrally sparse (rank K « n), one could use unitarily invariant regularization methods, 
like (squared) trace norm-based online algorithms (e.g., \4T\ \10\ For instance, Matrix Winnow 
(more generally, Matrix EG-like algorithms \4°1 ) would get bounds which are linear in the cutsize 
but also (due to their unitary invariance) linear in \ \xt\\\. The latter can be as large as the diameter 
ofT, which can easily be 0{n) even if the diameter of G is much smaller. This makes these bounds 
significantly worse than Theorem^ when the total cutsize \<& G \ is small compared to n (which is 
our underlying assumption throughout). Group norm regularizers can also be used. Yet, because 
Xt has rank one, when \<& G \ is small these regularizers do not lead to better bound^ than Theorem 
[5| Moreover, it is worth mentioning that, among the standard mirror descent linear-threshold 
algorithms operating on vectors vec(-), our choice of the r-norm Perceptron is motivated by the 
fact this algorithm achieves a logarithmic bound in n with no prior knowledge of the actual cutsize 



(or an upper bound thereof) - see Section 3.2, and the discussion in [11] about tuning of 



parameters in r-norm Perceptron and Winnow/ Weighted Majority-like algorithms. 

As a final remark, our algorithm has an 0(n 2 ) running time per round, trivially due to the 
update rule operating on O(n 2 )-long vectors. The construction of instance vector Xt out of path 
P(i t -+j t ) can indeed be implemented faster than @(n 2 ) by maintaining well-known data structures 
for disjoint sets (e.g., fl4\ Ch. 22]). 
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A Proofs 

This appendix contains all omitted proofs. Notation is as in the main text. 
A.l Missing proofs from Section [2] 

The set of example sequences consistent with a concept / for class prediction is denoted by S c (f) := 
f(x))}xex)*, and for similarity prediction by S s (f) := ({((V, x"), sim(/(x'), f (%"))} x',x"ex)* - 
A prediction algorithm is a mapping A : [X x y) — > y x from example sequences to prediction 
functions. Thus if A is a prediction algorithm and S = (x-y, yi), . . . , (xt, Vt) G {X x y) is an 
example sequence, then the online prediction mistakes are 

T 

M A {S) := [A(( Xl , yi ), {x t -i,yt-i)){xt) + y t ] . 
t=i 

We write S' C S to denote that S' is a subset of S as well to denote that S' is a subsequence of S. 

We now introduce a weaker notion of a mistake bound as defined with respect to specific 
sequences rather than to the alternate notion of a concept. The weakness of this definition allows 
the construction in the following lemma to apply to "noisy" as well "consistent" example sequences. 
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Definition 6. Given an algorithm A, we define the subsequential mistake bound with respect 
to an example sequence S as 

B° A (S) :=maxM A (5') . 

Thus, a subsequential mistake bound is simply the "worst-case" mistake bound over all subse- 
quences. 

Lemma 7. Given an online classification algorithm A, there exists a similarity algorithm A' such 
that for every sequence S = (xi, yi), . . . , (x 2 T, Vrr) its mistakes on 

S' = ((xi,x 2 ),sim(yi,y 2 )), • • • , ({xzr-i, xzr), sim(y 2 T-i, V2t)) 

is bounded as 

M A ,(S'))<cM A (S)log 2 K, (6) 

with c < 5. 

Proof. The proof of Q works by a modification of the standard weighed majority algorithm |37j 
arguments. The key idea is that similarity reduces to classification if we received the actual class 
labels as feedback rather than just similar/dissimilar as feedback. Since we do not have the class- 
labels, we instead "hallucinate" all possible feedback histories and then combine these histories on 
each trial using a weighted majority vote. If we only keep track of histories generated when the 
weighted majority vote is mistaken, the bound is small. Our master voting algorithm A' follows. 

1. Initialisation: We initialize the parameter /3 = 0.294. We create a pool containing example 
sequences ("hallucinated histories") S := {s}, with initially the empty history s = () with 
weight w s := 1 . 

2. For t = 1,...,T do 

3. Receive: the pattern pair {x2t— l, ^%t) 

4. Predict: similar if 

^2w s [A(s)(x 2t -i) = A{s)(x 2 t)} > ^2w s [A(s)(x 2t -i) + A(s)(x2t)] 
seS ses 

otherwise predict dissimilar. 

5. Receive: Similarity feedback sim(y2t-i, 2/2t) if prediction was correct go to[2j 

6. Two cases, first if this algorithm predicted SIMILAR when the pair was dissimilar then for 
each history s S S with a mistaken prediction create K x (K — 1) histories si <2 , ■ ■ . , sk,k-i 
so that Sij is equal to s but has the two "speculated" examples (x 2 t-i,i), (x 2 t,j) appended 
to it. Then set w Sl 2 = . . . = w SK K _ 1 := kTk-T) Ws anc ^ remove s from S. Second (predicted 
dissimilar) as above but now we need to add only K new histories to the pool. 

7. GotoH 
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Observe that there exists a history s* 6 5 generated by no more B^(S') "mistakes" since there is 
always at least one history in the pool S which is a subsequence of S. Thus 



K(K -I) 

Furthermore, the total weight of the pool of histories W := X^eS Ws ^ s reduced to a fraction of 

-±£ 

2 



its weight no larger than whenever this master algorithm A' makes a mistake. Thus since 



W > w s * , we have 



2 J ~ \K(K - 1) 



Solving for Ma'(S') we have 

M A ,(,S')<B^(5) 



log 2 2 



'2 

Substituting in /3 = .294 allows us to obtain the upper bound of @ with c « 4.99. □ 

We observe, that reduction of similarity to classification holds for a wide variety online mistake 
bounds. Thus, e.g., we do not require the input sequence to be consistent, i.e., in S we may have 
examples (x',y') and (x",y") such that x' = x" but y' ^ y" . The usual type of mistake bound 
is permutation invariant i.e., the bound is the "same" for all permutations of the input sequence; 
typical examples include the the Weighted Majority [37J and p-No-RM Perceptron [201 EZ! 
algorithms. Observe that if B,(S) is a permutation invariant bound, then M A (S) < B,(S), since 
every subsequence of S is the prefix of a permutation of 5. However, our reduction also more 
broadly applies to such "order-dependent" bounds, as the shifting-expert bounds in [26J. 

We now show that classification reduces to similarity. This reduction is efficient and does not 
introduce a multiplicative constant but requires the stronger assumption of consistency not required 
by Lemma [7} 

Lemma 8. Given an online similarity algorithm A s there exists an online classification algorithm 
A c such that for any concept f if S £ S c (f) then 

M A c(S)< max M A 4S') + K . (7) 
S'eS'(f) 

Proof. As a warm-up, pretend we know a set PCX such that |P| = K and for each i G {1, . . . , K} 
there exists an x £ P such that f{x) = i. Using P we create algorithm A c as follows. We maintain 
a history (example sequence) h, which is initially empty. Then on every trial when we receive 
a pattern xt we predict yt £ {f( x ) '■ A s (h)((x,x t )) = similar, x £ P}, and if the set contains 
multiple elements or is empty then we predict arbitrarily. If A c incurs a mistake, we add to our 
history h the K examples ((x, xt), sim(/(x), f(xt))) x eP- Observe that if A c incurs a mistake then 
at least one example corresponding to a mistaken similarity prediction is added to h and necessarily 
h E S s (f). Thus the mistakes of A c are bounded by max5/ gi5S (j) Ma"(S'). Now, since we do not 
actually know a set P, we may modify our algorithm A c so that although P is initially empty we 
predict as before, and if we make a mistake on xt because there does not exist an x G P such that 
f(x) = f(xt), we then add xt to P. We can only make K such mistakes, so we have the bound 
of 0. □ 
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Proof of Theorem [TJ If S is a classification sequence consistent with a concept / and A is 
a classification algorithm, then M° A (S) < Ma{J), and hence ^ implies ([!]). Then, since ([7]) is 
equivalent to ([2]), we are done. □ 



A. 1.1 The logK term is necessary in Theorem [TJ 



In our study of class prediction on graphs we observed (see Appendix A. 2) that certain 2-class 
bounds may be converted to If -class bounds with no explicit dependence on K. Yet Theorem [JJ 
introduces a factor of log if for similarity prediction. So a question that arises is this simply a 
byproduct of the above analysis or is the "log if" factor tight. In the following, we demonstrate it is 
tight by introducing the paired permutation problem, which may be "solved" in the classification 
setting with no more than 0{K) mistakes. Conversely, we show that an adversary can force 
Q(KlogK) mistakes in the similarity setting. 

We introduce the following notation. Let z : {1, . . . , if} — > {1, . . . , if} denote a permutation 
function, a member of the set of all if! bijective functions from {1, . . . , if} to {1, . . . , if}. 
A paired permutation function is the mapping y z : {1, . . . , if} 2 — > {1, . . . , if}, with y z (x',x") := 
max(z(i'), z(x")). So, for example, consider a 3-element permutation z(l) — > 2,z(2) — > 3, and 
z(3) — > 1. Then, e.g., y z {l,2) — > 3 and y z (l,l) — > 2. Thus, we define the set of the paired 
permutation problem example sequences for class prediction as PP C := L) Z £i K S c (y z ), and for 
similarity as PP S := U z&Zl< S s (y z ). 

Theorem 9. There exists a class prediction algorithm A such that for any S G PP C we have 
Ma(S) = O(K). Furthermore, for any similarity prediction algorithm A' , there exists an S' G PP S 
such that M A ,(S') = n(KlogK). 

Proof of Theorem [9} First, we show that there exists a class prediction algorithm A such that 
for any S G PP C we have Ma{S) = 0{K). Consider the simpler problem for the concept class 
of permutations U z ^i K S c {z). By simply predicting consistently with the past examples we cannot 
incur more than K — 1 mistakes. The algorithm Aq(s) (CONSISTENT PREDICTOR), predicts y on 
receipt of pattern xt if there exists some example (x, y) in its history s such that xt = x, otherwise 
it predicts a y G {1, . . . , K} not in its history. Now, using Aq as a base algorithm, we can use the 
principle of the master algorithm of Lemma[7]to achieve O(K) mistakes for the paired permutation 
problem. Thus, when we receive a pair ((x' , x"),y) either y z (x') = y or y z (x") = y, hence on 
a mistake we may "hallucinate" these two possible continuations. Our class prediction example 
sequence is S = (a/ l5 x'{), yi), . . . , ((x' T , x'^), yx) G PP C , and the algorithm A follows. 

1. Initialisation: We initialize the parameter ft = 0.294. We create a pool containing example 
sequences ("hallucinated histories") S := {s}, with initially the empty history s = () with 
weight w s := 1 . 

2. For t = 1, . . . ,T do 

3. Receive: the pattern xt = (x' t ,x'-l) 

4. Predict: 

y t = argmax } y w s [max( A (s) (x' t ) , A (s) (x")) = k] . (8) 
ke{i,...,K} s£S 
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5. Receive: Class feedback y t G {1, . . . , K}. If prediction was correct go to [2] 

6. For each history s G S with a mistaken prediction, create two histories s',s" so that s' (s") 
is equal to s but has the example (x' t ,y) (the example (x",y)) appended to it. Then set 
w s i = w s " := ^w s , and remove s from 5. 

7. Go to [2 

Observe that there exists a history s* G S generated by no more K — 1 "mistakes" . This is because, 
by induction, there is always a consistent history (i.e., the empty history is initially consistent, 
and when the master algorithm A makes a mistake and a consistent history makes a mistake, 
then either the continuation (x' t ,y) or (x",y) is consistent). Finally, observe that once a consistent 
history contains K — 1 examples, it can no longer make mistakes. Thus 



W.* > 



2 



Furthermore, the total weight of the pool of histories W := Y^ses Ws * s reduced to a fraction of its 
weight no larger than whenever this master algorithm A makes a mistake. Since W > w s * , we 
have 



Solving for Ma{S) we can write 

M A (S) <(K-l) 



, l0 S2lT/3, 



Substituting in j3 = 0.294 allows us to obtain the upper bound of Ma(S) < 4A(K — 1). The 
argument then follows as in Theorem [TJ 

Now consider the similarity problem. If we receive an instance of the form (((x', x"), (x" , x")), y) 
(with x' / x") then y = SIMILAR implies z(x') < z(x") and y = disimilar implies z(x') > z(x"). 
Thus with each mistaken example we learn precisely a single '<' comparison. It follows from 
standard lower bounds on comparison-based sorting algorithms (e.g., [2]) that an adversary can 
force f^Klogi^) comparisons, and thus mistakes, for any "comparison" -algorithm to learn an 
arbitrary permutation. □ 

Any problem associated with a set of example sequences S may be iterated into a set of r 
independent problems by a cross-product-like construction, so that if S 1 , . . . , S r G S and if S l = 
(x\,y\), . . . , (xj,.,yj,.) then an r-iterated example sequence is 

((xl,!),^),...,^,!),^^,...,^!^),^),...^^^),^). 

We have simply conjoined the r example sequences into a single example sequence with each pattern 
"x" paired with an integer indicating from which sequence it originated. Thus by r-iterating the 
paired permutation problem we trivially observe mistake bounds of 0(rK) and Q(rK log-fT) for 
all r G N in the class and similarity setting, respectively, thereby implying that the multiplicative 
"log-ftT" gap occurs for an infinite family of classification/similarity problems. 
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A. 2 Missing proofs from Section [3] 

Lifting 2-class prediction to K-class prediction on graphs 

Suppose we have an algorithm for the 2-class graph labeling problem with a mistake bound of the 
form M < c\$ G (y)\ for all y G {1, 2}™, with c > 0, We show that this implies the existence in the 
iiT-class setting of an algorithm with a bound of M < 2c\Q G (y)\ for all y G {1, . . . ,K} n , where K 
need not be known in advance to the algorithm. 

The algorithm simply works by combining the predictions of "one versus rest" classifiers. We 
train one classifier per class, and introduce a new classifier as soon as that class first appears. On 
any given trial, the combination is straightforward: If there is only one classifier predicting with 
its own class then we go with that class, otherwise we just assume a mistake. Thus, on any given 
trial, we can only be mistaken if one of the current "one-verse-rest" classifiers makes a mistake. 
This implies that our mistake bound is the sum of the mistake bounds of all of the "one- verse-rest" 
classifiers. Because each such binary classifier has a mistake bound of the form M < c|3>jf (f/)|, and 
Ylk=i l^fc (v)\ = we have that the i^-class classifier has a bound of the form 2c\& G (y)\. 

Proof of Theorem [2} We show that computing the partition function for the Ising model on a 
general graph reduces to computing the partition function problem for the Ising Model on a path 
graph with pairwise constraints hence showing #P-completeness. The partition problem for the 
Ising Model on a graph is defined by, 

Instance : An n-vertex graph G, and a natural number, /3, presented in unary notation. 

Output: The value of the partition function Zq(/3), 

Z G {p) := Y, 2-^ G ^l . 
ye{l,2}« 

This problem was shown #P-complete in [30, Theorem 15]. The reduction to the partition problem 
on a path graph with constraints is as follows. 

We are given a graph G = (Vg, Eq) with n = \Vg\, and further assume each vertex is "labeled" 
uniquely from 1, . . . , n. We construct the following path graph with pairwise constraints (see Figure 
[4]) for an illustration. 

1. Find a spanning tree T = (Vp, Et) of G, and let R = Eg — Ep. 

2. Perform a depth-first-visit of T. From the 2n — 1 vertex visit sequence, create an isomorphic 
path graph Pq with 2n— 1 vertices such that each vertex in Pq is labeled with the corresponding 
vertex label from the visit of T. Thus each edge of T is mapped to two edges in Pq. 

3. We now proceed to create a path graph P = (Vp, Ep) from Pq, which also includes each edge 
in R twice. We initialize P as a "duplicate" of Pq including labels. For each edge (v' r ,v") £ R 
we then do the following: 

(a) Choose an arbitrary vertex v' G Vp so that v' and v' r have the same label; 

(b) Let v"" be a neighbor of v' in P (i.e, (v',v"") G E P ); 

(c) Add vertices v" and v'" to P with the labels of v" and v' r , respectively; 
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Figure 4: (a) The graph G with vertices labelled 1 to 5. (b) A spanning tree T of G. Note that 
R = {(1,4), (2, 5)}. (c) The labeled path graph Pq. (d) Addition of vertices associated with edge 
(1,4) (the light blue vertices), (e) Addition of vertices associated with edge (2,5) (the light blue 
vertices), forming the path graph P. Note that every edge in G now has exactly two analogous 
edges in P, accounting for all the edges in P. 

(d) Remove the edge (v', v"") from P and add the edges (v 1 , v"), (v", v'") and (v"\ v"") to P. 

4. Finally create pairwise equality constraints between all vertices with the same label. 

Thus observe for every edge in G there are two analogous edges in P, and furthermore if edge 
(v,w) G then there is not an analogous edge in P. Hence Zg(2/3) = Zp(C,(3). 

Proof sketch Proposition [2] We start off with the Matrix Perceptron bound. For brevity, we 
write Xt instead of Xf. Also, let (A, B) be a shorthand for the inner product tr(A t B). We can 
write 

(X t ,X t ) = TR((* + f(e H - e jt )(e it - e ]t ) T ^ (* + ) 7 \e H - e jt )(e it - e n f* + ) 
= tr(K - e Jt f*+(* + ) T (e H - e H )(e lt - e jt f^ (* + ) T \e k - e Jt )) 

= (( e *t - e it) TL+ { e h - e it)? 
< R 2 . 

Moreover, for any k € {1, . . . , K}, define K vectors u\, . . . , uk £ K n as follows. 

u k = (u k ,i, • • • , Ufe,n) T , with u k ,i = [k = yi] , 
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being yi the label of the i-th. vertex of G. Now, if we let U := \P ( Y2k=l u k u J ) we have 



(U,X t ) =TR(U 1 X t ) 
K 

= Y,™{mu k u T k m T {m + ) T (e H - e jt )(e k - e jt ) T ^>' 



k=i 

K 



£>R((e it - e jt ) T * + *u k uT* T (y+f(e it - e jt )) 

k=l 
K 



k=l 



By definition of pseudoinverse, ^{^ + ^u k ) = {^'^ + ^)uj t = ^fu^ for all k = 1,...,K. Hence 
(recall Seetionpj), ty + i&Uk = u^ + cl for some c£l. We therefore have that (e^ — ej t ) T ^ + ^Uk = 



K 



(U,X t ) = ^2(u k ,i t - u kJt ) 2 . 



k=l 



Now, if yi u j t = (i.e., yi t = yj t ) then for all k we have ut : i t — Ukj t = 0, so that (U,X t ) = 0. On 
the other hand, if yi t j t = 1 (i.e., yi t ^ yj t ) then there exist distinct a, b £ {1, . . . , K} such that 
\ u a,i t ~ u a,j t \ = \ u b,h ~ u b,j t \ = 1) an d for all other k / a, b we have Uk t i t — Ukj t = 0. So, in this case 
(U,X t ) = 2. 

This gives the linear separability condition of sequence [X\, 2/iui), (X2,yi 2 j 2 ), ■ ■ ■ w.r.t. U. 
Finally, we bound (U, U). Let <3?^ fo := {(i, j) G E : yi = a, yj = b}. We have: 

(U, U) = tr(U t U) 

K K 

a=l b=l 
K K 

a=l b=l 



a=l 6=1 
K 



a=l \ b^a 



G |2 
bl 



So, noticing that Y^b-.b^a l^a&l = l*a I and hence that ]Cb:f,^a I^J 2 - l*a I 2 ' we conclude that 

(U,U) < 2|$ G | 2 . 
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With the above handy, the mistake bound on M p easily follows from the standard analysis of the 
Perceptron algorithm with nonzero threshold. 

By a similar token, the bound on M w follows from the arguments in [37j, after defining U to 
be a normalized version of the one we defined above for Matrix Perceptron, and noticing that 
in Algorithm [T] are positive semidefinite and normalized to trace 1. 

A. 3 Missing proofs from Section [4] 

The following lemma relies on the equivalence between effective resistance Rfj of an edge and 
its probability of being included in a randomly drawn spanning tree. 

Lemma 10. Let (G, y) be a labeled graph, and T be a spanning tree of G drawn uniformly at 
random. Then, for all k = 1, . . . , K, we have: 



1. E[|#£| 



E(y) £ 4c4' and 



2. E[|*j^]<2(£ W ) 6 *cft: 



G \2 
i,j' 



Proof. Set s = | and = {(h,ji), {12,32), {is, 3s)}- Also > for I = 1, . . . ,s, let Xi be the 
random variable which is 1 if {ii,je) is an edge of T, and otherwise. From E[Xg] = Rf e j e we 
immediately have 1). In order to prove 2), we rely on the negative correlation of variables Xi, i.e., 
that E[X t Xff] < E[X t ]E[Xff] for 1^1' (see, e.g., [38])- Then we can write 



E(|$ 



T\2\ 
k I J 



E 



E 



1 1 w 



=1 £'=1 



= ^E[X e ] + ^2^2E[XeXi>] 

1=1 £=l 
s s 

< nxe] +YY, E ^ E i^'] • 

Now, for any spanning tree T of G, if s > 1 then it must be the case that |5>jT| > 1, and hence 
E|=i E [^] = E i\®k W > 1 • Combined with the above we obtain: 



(s \2 S / s \ 2 / s 

J2E[xA +^^E[X,]E[X,,]<2 (X>[X,]) =2(E^ : 
£=1 / £=1 t'JJL 



G 



as claimed. 



□ 



Proof of Theorem [3] From Proposition [2] we have that if we execute Matrix Winnow on B = Bt 
with similarity instances constructed from \& b , then the number M of mistakes satisfies 

M = O {\<5> B \D B logn) , 
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where Db is the resistance diameter of B. Since B is a tree, its resistance diameter is equal 
to its diameter, which is O(logra). Moreover, \&?\ = C(|$T|logn), for k = 1,...,K, hence 



\$ B \ = 0(\3> T \ logre). Plugging back, taking expectation over T, and using Lemma 10, 1) proves 
the Matrix Winnow bound. Similarly, if we run the Matrix Perceptron algorithm on B with 
similarity instances constructed from g then 

M = O (\<& B \ 2 D 2 B ) ■ 

Proceeding as before, in combination with Lemma [10| 2), proves the Matrix Perceptron bound. 



Proof of Theorem |4] First of all, the fact that the algorithm is 0(log 2 n) per round easily follows 
from the fact that, since B is a balanced binary tree, the sizes of sets Vt (prediction step in Q) 
and St (update step in ([5])) are both O(logn). 

As for initialization time, a naive implementation would require 0(n 2 ) (we must build the zero 
matrix F). We now outline a method of growing a data structure that stores a representation of F 
online for which the initialisation time is only 0(n), while keeping the per round time to C(log 2 n). 
For every vertex £ in B the algorithm maintains a subtree Bi of B, initially set to {p}, being p the 
root of B. At every vertex £' £ Bp is stored the value Fgi>. At the start of time t, the algorithm 
climbs B from it to p, in doing so storing the ordered list £j t of vertices in the path from p to it- 
The same is done with j t . The set St is then computed. For all £ £ St, the tree Bg is then extended 
to include the vertices in Mt and the path from it (note that for each £ € St this takes only O(logn) 
time, since we have the list Ci t ). Whenever a new vertex £' is added to Bi, the value Fug; is set to 
zero. Hence, we initialize F "on demand", the only initialization step being the allocation of the 
BST, i.e., 0(n) time. 

We now continue by showing the equivalence of the sequence of predictions issued by Q to 
those of the Matrix Perceptron algorithm with similarity instances constructed from b- 

For every £ £ St define A t (£) as the maximal subtree of B that contains £ and does not contain 
any nodes in V t \ {k,jt}- 

Lemma 11. At(-) defined above enjoys the following properties (see Figure^ left, for reference). 

1. For all £, At(£) is uniquely defined; 

2. Any subtree T of B that has no vertices from Vt \ {it,jt} (and hence any of the trees A t ) 
contains at most one vertex from St; 

3. The subtrees {A.t(£) : £ £ St} are pairwise disjoint; 

4- The set {A t (£) : £ £ St} U (Vt \ {itijt}) covers B (so in particular {A t (£) : £ E St} covers the 
set of leaves of B). 

Proof. 1. Suppose we have subtrees T and T' with T 7^ T" that both satisfy the conditions of 
At(£). Then w.l.o.g assume there exists a vertex £' in T that is not in T". Since T and T" 
are both connected and both contain £, the subgraph T U T of B is connected and is hence 
a subtree. Since neither T nor T' contains vertices in Vt \ {it,jt}, T U T" does not contain 
any such either. Hence, because T' is a strict subtree of T U T', we have contradicted the 
maximality of T' . 
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2. Suppose T has distinct vertices £,£' G St- Since T is connected, it must contain the path in 
B from £ to £'. This path goes from I to the neighbor of £ that is in Vt \ {it, jt}, then follows 
the path Vt \ {it,3t} (i n the right direction) until a neighbor of £' is reached. The path then 
terminates at £' . Such a path contains at least one vertex in Vt \ {k,jt}, contradicting the 
initial assumption about T. 

3. Assume the converse - that there exist distinct £,£' in St such that At(£) and At(£') share 
vertices. Then, since A t (£) and A t (£') are connected, A t (£) U A t (£') must also be connected 
(and hence must be a subtree of B). Since A t (^)uAj(/) shares no vertices with Vt\{it,jt}, and 
contains both £ and £' (which are both in St), the statement in Item 2 above is contradicted. 

4. Assume that we have a £ G B \ (Vt \ {it,jt})- Then let P' be the path from £ to the (first 
vertex encountered in) the path Vt \ {it,jt}- Let £' be the second from last vertex in P' . Then 
£' is a neighbor of a vertex in Vt, but is not in Vt \ {it,jt}, so it must be in St- This implies 
that the path P" that goes from £ to £' contains no vertices in Vt \ {it,jt} and is therefore 
(Item 1) a subtree of A t (£'). Hence, £ G A t {£' ). 

□ 




Afd) 




Figure 5: Left: The same BST as in Figure[2]with it = 2 and jt = 5. Light blue vertices are those 
in Vt- Thick-bordered vertices are those in St- Since vertices 3 and 4 are in A^(10), we have ft(3) = 
/ t (4) = / t (10). Since vertices 7 and 8 are in A t (12), we have f t (7) = f t (8) = /t(12). For all other 
vertices £, we have ft(£) = ft{£)- Right: The same BST as in Figure pi with path Vt' (light blue 
vertices) having endpoints if = 3 and j# = 8. Thick-bordered vertices are still those in St- Path 
Vf intersects St at two vertices, 10 and 12, which means that 3 G Aj(10) and 8 G A^(12). We have 
ft(3) = f t (W), f t (8) = / t (12), and ((e 3 - e 8 )L+(e 2 - e 5 )) 2 = (/ t (3) - / t (8)) 2 = (/ t (10) - / t (12)) 2 . 

Lemma 12. Let L be the Laplacian matrix of B, and £,£' G St- Then for any pair of vertices k 
and k 1 of B with k G At(£) and k' G At(t!) we have 



(e K - e K t) T L + (e. 



ft{£')~ft(£) , 



where is the i-th element in the canonical basis of] 



o2n-l 
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Proof. We first extend the tagging function ft to all vertices of B via the vectoi]^] ft as follows (note 
that, by Lemma 11, ft is well defined): 

1. For all eeP t \ {it,jt}, set f t (£) = f t {£); 

2. For all £' e S t and £ e A t (£'), set f t {£) = f t {£'). 
Claim 1. Lf t = e jt - e it . 

Proof of claim. For any vertex k of B \ {U,jt} one of the following holds: 

1. If k £ Vt, then k has a neighbor k\ with ft(ni) = ft( K ) ~ 1) one neighbour k 2 with /t^) = 
ft{n) + 1, and (unless k is the root of B) one neighbour K3 with fti^z) = ft( K )- We therefore 
have that [Lf t ] K = 3/ 4 (k) - / t (/«i) - /<(k 2 ) - /t(«3) = 0. 

2. If ft S M \ "Pt, then k has one neighbor k\ in Vt and we have ft{^i) = /t(^)- Let T K be the 
subtree of B containing exactly vertex k and all neighbors of k bar k±. Since Vt is connected, 
it contains k\ and does not contain k, none of the other neighbors of k being in Vt- Hence 



T K is a subtree of B that contains k and no vertices from Vt \ {it,jt}, an d so by Lemma 11 
item 1 it must be a subtree of Aj(k). Hence, by definition of ft, all vertices K2 in T K satisfy 
ft{^2) = ft(i^)- This implies that for all neighbors K3 of k we have ft{^z) = fti^), which in 
turn gives [Lf t ] k = 0. 



3. If k ^ A/j then, by Lemma 11 item 4, let k be contained in At(£) for some £ E 5t. Let T K 
be the subtree of B containing exactly vertex k and all neighbors of k. Note that T K is a 
subtree of i? that contains k and no vertices from Vt \ {u,jt}- Since At(£) also contains k 
(hence At(£) U T K is connected), we have that At(£) U T K is a subtree of -B that contains £ and 



no vertices from V \ {it,jt}- By Lemma 11 item 1, this implies that At(£) U T K is a subtree 
of (and hence equal to) At{£). Hence, by definition of ft, we have that ft is identical on T K . 
Thus all neighbors k\ of k satisfy ft{^i) = ft(n), implying again [Lf t ] K = 0. 

So in either case [Lf t ] K = 0. 

Finally, let i' t be the neighbor of it in B. We have [Lft]i t = ft(it)~ ft(i't) = 1 — 2 = —1. Similarly, 
we have [Lft]j t = 1. Putting together, we have shown that Lft = e Jt — e^, thereby concluding the 
proof of Claim [TJ 

Now, by definition of pseudoinverse, 

Lf t = LL + Lf t = LL + (e, Jt -e it ) . 

This mplies that L(ft — L + (ej t — ei t )) = 0. Therefore (see Section ^ there exists a constant c such 
that f t = L+( 

£j t ~ e u ) + c l • From the definition of / we can write 

W) - ft(£) = ft(K) - hi*) 

= ([ L+ { e h ~ eh)}*' - c) - ([L + (e jt - e it )] K - c) 
= (e K - e K ,) T L + (e k - e jt ), 

as claimed. □ 



6 In our notation, we interchangeably view / both as a tagging function from the 2n — 1 vertices of B to the 
natural numbers and as a (2n — l)-dimensional vector. 
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Lemma 13. Let L be the Laplacian matrix of B, and k,k' be two vertices of B. Let V be the path 
from k to k' in B. Then for any t either \P D St\ < 1 or V DSt = {£, £'}, for two distinct vertices 
£ and £' . No other cases are possible. Moreover, 



((e« - e Kl ) T L + (e it - e jt )f 



if\VC\S t \<\ 
(ft(i) - ft(i')) 2 tfVDSt = {£,£'} 



Proof. By Lemma 11 item 4, we have two possible cases only: 

1. There exists £ £ St such that both n and re' are in At(£): In this case (since At(£) is connected) 
the path V lies in At(£). Since, by Lemma 11 item 2, no £' £ St with £' ^ £ can be in At(£), 
it is only ever possible that V contains at most one vertex £ (if any) of St- 

2. There exist two distinct nodes £,£' £ St such that re £ A{£) and re' £ A(£'). In this case, 
V corresponds to the following path: First go from re to £ (by Lemma [II] item 2, since this 
path lies in A(£) the only vertex in St that lies in the section of the path is £); then go to 
the neighbor of £ that is in Vt \ {k,jt}', then follow the path Vt \ {U,jt} until you reach the 
neighbor of £' (this section of V contains no vertices in St); then go from £' to k (by Lemma 



11 item 2, since this path lies in A{£') the only vertex in St that lies in this section of the 



path is £'). Thus, VC\S t = {£,£'}. 
The result then follows by applying Lemma [i~2| to the two cases above. □ 
Figure [5] illustrates the above lemmas by means of an example. 

To conclude the proof, let (A,B) be a shorthand for tr(A t B). We see that from Algorithm [I] 
Lemma [l3| and the definition of F in Q we can write 

t-i 

(Wt,X t }= ( 2 fv.v *t> 

t'=l,t'eM 
t-i 

= E ( 2 ^*'Jt' ~ X )(( e v ~ e j t i) TL+ (. e h ~ e it)f 
t'=i,t'eM 

= 2 F W ' 

where M is the set of mistaken rounds, and the second-last equality follows from a similar argument 
as the one contained in the proof of Proposition [2j Threshold 2 log n in Q is an upper bound on 
the radius squared (Xt, Xt) of instance matrices (denoted by R 2 in Algorithm [l]) . In fact, from the 
proof of Proposition [2j 

max{Xt,X t ) < max ((e% — e 7 -) T L + (ej — e,-)) 2 = max (R?A 2 , 
t (i,j)&v 2 J J (i,j)ev 2 J 

which is upper bounded by the square of the diameter 21ogn of B. 

A. 4 Ancillary results, and missing proofs from Section [5] 

This section contains the proof of Theorem [5j along with preparatory results. 
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A digression on cuts and directed paths 

GiverQa connected and unweighted graph G = (V,E), with n = \V\ vertices and m = \E\ edges, 
any partition of V into two subsets induces a cut over E. A cut is a cutset if it is induced by 
a two-connected component partition of V. Fix now any spanning tree T of G (this will be the 
one constructed by the algorithm at the end of the game, based on the paths produced by the 
adversary - see Section 5.1). In this context, the n — 1 edges of T are often called branches and 
the remaining m — n + 1 edges are often called chords. Any branch of T cuts the tree into two 
components (it is therefore a cutset), and induces a two-connected component partition over V. 
Any such cutset is called a fundamental cutset of G (w.r.t. T). Cuts are always subsets of E, hence 
they can naturally be represented as (binary) indicator vectors with m components. For reasons 
that will be clear momentarily, it is also convenient to assign each edge an orientation (tail vertex 
to head vertex) and each cut an inward/outward direction. In particular, it is customary to give 
a fundamental cut the orientation of its branch. As a consequence of orientations/directions, cuts 
are rather represented as m-dimensional vectors whose components have values in {—1, 0, 1}. 
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Figure 6: (a) A graph with oriented edges, tagged e± through e%2- A spanning tree T is de- 
noted by the thick edges. The cutset {4, 5, 6, 7} is depicted in blue. This cut has inward branches 
^3) 64, es, and e§, and no outward branches. The red cutset separating vertex 3 from the rest 
has inward branch e2, and outward branch e%. Two other (nodal) cutsets are shown: one sep- 
arating vertex 2 from the rest, and the other separating vertex 1 from the rest, (b) The fun- 
damental cutset matrix associated with the chosen spanning tree. Viewed as a 12-dimensional 
vector, the blue cut is (7(4,5,6,7} = (0, 0, — 1, — 1, — 1, — 1, — 1, — 1, 0, 0, 0, 0), and can be represented 
as linear combination of rows Qi of Q as —Q3 — Qa — Q5 — Qe, i-e., by the vector of coefficients 
■"{4,5,6, 7} = (0) 0) — 1) ~~ 1) — 1) — 1) T - Notice that chords ej and es are both inward (coefficients —1 in 
<?{4,5,6,7})- ( c ) The connectivity structure induced by the selected spanning tree on the blue cutset 
in (a). For ease of reference, all edges in the blue cut have been turned into light gray. Vertices 4, 
5, 6, and 7 are all connected in G under the blue cut, but are they are all disconnected in T. The 
path 6— t-3— 7-1— t-2— > 7 (depicted in blue) connects in T vertex 6 to vertex 7, and is represented 
by path p = (1, -1, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0) T , hence Qp = (1,-1,0,0, 1, -1) T . This path departs 
from the blue (disconnected) cluster {4,5,6,7} through edge e$ (traversed "the wrong way") and 
returns to this set via e§. Notice that ttj 4 g 6 7 yQp = 0. 

7 The reader familiar with the theory of matroids will recognize what is recalled here as a well known example of 
a regular matroid on graphs. One can learn about them in standard textbooks/handbooks, e.g., P2, Ch.6]. 
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Figure [6] (a) gives an example. In this figure, all edges are directed from the low index vertex to 
the high index vertex. Branch e\ determines a cutset (more precisely, a fundamental cutset w.r.t. to 
the depicted spanning tree) separating vertices 2 and 7 from the remaining ones. Edge e§ isolates 
just vertex 6 from the rest (again, a fundamental cutset). The fundamental cut determined by 
branch ei is represented by vector q = (1, 0, 0, 0, 0, 0, —1, 1, 0, 1, 0, 1). This is because if we interpret 
the orientation of branch e\ as outward to the cut, then e~j is inward, e% is outward, as well as eio 
and ei2- The matrix in Figure [6] (b) contains as rows all fundamental cutsets. This is usually called 
the fundamental cutset matrix, often denoted by Q (recall that this matrix depends on spanning 
tree T - for readability, we drop this dependence from our notation). Matrix Q has rank n — 1. 
Moreover, any cut (viewed as an m-dimensional vector) in the graph can be represented as a linear 
combination of fundamental cutset vectors with linear combination coefficients —1, +1, and 0. In 
essence, cuts are an (n — l)-dimensional vector space with fundamental cutsets (rows Qi of Q) as 
basis. It is important to observe that the vectors Qi involved in this representation are precisely 
those corresponding to the branches of T that are either moving inward (coefficient —1) or outward 
(coefficient +1). Hence the fewer are the branches of T cutting inward or outward, the sparser is 
this representation. Matrix Q has also further properties, like total unimodularity. This implies 
that any linear combination of their rows with coefficients in { — 1,0, +1} will result in a vector 
whose coefficients are again in {— 1, 0, +1}. 

To summarize, given a spanning tree T of G, a direction for G"s edges, and the associated 
matrix Q, any cutsetj^] q in G can be represented as an m-dimensional vector q = Q T u, where 
u 6 { — 1, 0, +l} n ~ 1 has as many nonzero components as are the branches of T belonging to q. With 
this representation (induced by T) in hand, we are essentially aimed at learning in a sequential 
fashion u's components. 

In order to tie this up with our similarity problem, we view the edges belonging to a given cutset 
as the cut edges separating a (connected) cluster from the rest of the graph, and then associate with 
any given ET-labeling of the vertices of G a sequence of K weight vectors Uk, k = 1, . . . , K, each one 
corresponding to one label. Since a given label can spread over multiple clusters (i.e., the vertices 
belonging to a given class label need not be a connected component of G) , we first need to collect 
connected components belonging to the same cluster by summing the associated coefficient vectors. 
As an example, suppose in Figure [6] (a) we have 3 vertex labels corresponding to the three colors. 
The blue cluster contains vertices 4,5,6 and 7, the green one vertex 1, and the red one vertices 2 
and 3. Now, whereas the blue and the green labels are connected, the red one is not. Hence we have 
^{4,5,6,7} = (0,0,-l,-l,-l,-l) T , u {1} = (1,1,1,1,0,0) T , and u {2i3} = (-1,-1,0,0, 1, 1) T is the 
sum of the two cutset coefficient vectors U{ 2 } = ( — lj 0, 0, 0, 1, 0) T , and W{ 3 } = (0, — 1, 0, 0, 0, 1) T . 
In general, our goal will then be to learn a sparse (and rank- if) matrix U = Ylk=l u k u ~k > where 
corresponds to the k-th (connected or disconnected) class label. 

Consistent with the above, we represent the pair of vertices (it,jt) as an indicator vector en- 
coding the unique path in T that connects the two vertices. This encoding takes edge orientation 
into account. For instance, the pair of vertices (6,7) in Figure [6] (c) is connected in T by path 
P(6^7) = 6— )>3— >• 1 — > 2 — > 7. According to the direction of traversed edges (edge ei is traversed ac- 
cording to its orientation, edge e 2 in the opposite direction, etc.), path P(6->7) is represented by vec- 
tor p = (1, -1,0, 0, 1, -1, 0, 0, 0, 0, 0, 0) T = ((p' t ) T K_ n+1 ) , hence Qp = p' t = (1, -1, 0, 0, 1, -l) T f\ 

8 Though we are only interested in cutsets here, this statement holds more generally for any cut of the graph. 

9 Any other path connecting 6 to 7 in G would yield the same representation. For instance, going back to Figure 
6 (a), consider path p[ fi ^ 7 - ) = 6 — > 4 — > 7, whose edges are not in T. This gives p' = (0, 0, 0, 0, 0, 0, 0, 0, —1, 1, 0, 0) . 
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It is important to observe that computing Qp does not require full knowledge of matrix Q, since 
Qp only depends on T and the way its edges are traversed. With the above handy, we are ready 
to prove Theorem [5] 
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Figure 7: The way the r-norm Perceptron for similarity prediction of Section [5] builds instance 
vectors. At the beginning of the game (t = 0) the algorithm is only aware of the number of vertices 
(n = 7 in this case). In round t = 1 the pair (2,3) is generated, along with the connecting path 
2—^1—^3. Since none of the two revealed edges is creating cycles, the two edges (1,2) and (1,3) 
are added to the forest in this order. Hence, x% = (—1, 1,0,0,0,0) T . Round t = 2: pair (4,5) and 
path 4 — > 1 — > 2 — > 5 are disclosed. Edges (1,4) and (2, 5) are revealed to the algorithm for the first 
time. The associated vector is then X2 = (1,0, —1, 1,0, 0) T . Round t = 3: A new edge is revealed 
which is disconnected from the previous subtree. We have x 3 = (0, 0, 0, 0, 1, 0) T . Round t = 4: The 
algorithm receives pair (4, 5) and corresponding path 4 — > 6 — > 7 — > 5. While edge (4, 6) is added 
to the forest, causing the two subtrees to merge, neither edge (6,7) nor edge (7,5) is added. In 
particular, (7, 5) is not added because of the presence of an alternative path in the current forest 
(which is now a single tree) joining the two vertices. Hence the observed path 4 — > 6 — > 7 — > 5 gets 
replaced by path 4 — > 1 — > 2 — > 5, and the corresponding instance vector is X4 = (1, 0, —1, 1, 0, 0) T . 
Round t = 5: Since we have obtained a spanning tree, from this point on, no other edges will be 
added. In this round we have 2:5 = (0, 0, 0, 0, 0, — 1) T , since the alternative (single edge) path 6 — > 4 
connecting 6 to 4 is already contained in the tree. 



Proof of Theorem[5] For the constructed spanning treq^T, let £ {—1, 0, l} n_1 be the vector 
of coefficients representing the k-th class label w.r.t. the fundamental cutset matrix Q associated 
with T, and set U = J2k=i u ku[ . Also, let x t be the instance vector computed by the algorithm at 
time t. Observe that, by the way Xt is constructed (see Figure [7] for an illustrative example) we have 
Xt = QPt = Pt f° r a h being p t and p' t the path vectors alluded at above. For any given class k, we 
have that ujxt = uTp^. Recall that vector Uk contains +1 in each component corresponding to an 

Yet, Qp' = Qp — (1, —1,0, 0, 1,-1) • This invariance holds in general: Given the pair the quantity Qp is 

independent of p, if we let p vary over all paths in G departing from i and arriving at j. This common value is the 
(n — l)-dimensional vector containing the edges in the unique path in T joining i and j (taking traversal directions 
into account). Said differently, once we are given T, the quantity Qp only depends on i and j, not on the path chosen 
to connect them. This invariance easily follows from the fact that cuts are orthogonal to circuits, see, e.g., |19| Ch.6]. 

10 If less than n — 1 edges end up being revealed, the set of edges maintained by the algorithm cannot form a 
spanning tree of G. Hence T can be taken to be any spanning tree of G including all the revealed edges. 
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outward branch of T, —1 in each component corresponding to an inward branch, and otherwise. 
We distinguish four cases (see Figure [6] (c), for reference): 

1. it and jt are both in the k-th class. In this case, the path in T that connects it to jt must 
exit and enter the k-th. class the same number of times (possibly zero) . Since we only traverse 
branches, we have in the dot product u^p' t an equal number of +1 terms (corresponding to 
departures from the k-th class) and —1 (corresponding to arrivals). Hence uj,p' t = 0. Notice 
that this applies even when the k-th class is not connected. 

2. it is in the k-th class, but jt is not. In this case, the number of departures from the k-th class 
should exceed the number of arrivals by exactly one. Hence we must have u^p' t = 1. 

3. it is not in the A;-th class, but jt is. By symmetry (swapping it with jt), we have uTpL = — 1. 

4. Neither it nor jt is in the k-th class. Again, we have an equal number of arrival/departures 
to/from the k-th class (possibly zero), hence u\,p\ = 0. 

We are now in a position to state our linear separability condition. We can write 

K K K 



VEC(U) T VEC(X t ) = TR(U T X t ) = TR^UkuJxtxJ) = ^(u^) 2 = ^(ujptf 



k=l k=l k=l 



which is 2 if it and jt are in different classes (i.e., it and jt are dissimilar), and 0, otherwise (i.e., it 
and jt are similar). We have therefore obtained that the label yt associated with (it,jt) is delivered 
by the following linear-threshold function: 

fl if VEc(U) T X t > 1 
1 otherwise . 

Because we can interchangeably view vec(-) as vectors or matrices, this opens up the possibility 
of running any linear-threshold learning algorithm (on either vectors or matrices). For r-norm 
Perceptrons with the selected norm r and decision threshold, we have a bound on the number M 
of mistakes of the form \20\ [T7] 

M = O (||vec([/)|| 2 ||VECpT t )||L logn) , 

where 

K K K K 

||VEC(Z7)||i = ||VEC(J^tifcti£)||i = || J^VEC(ti fc ti£)||i <^||VEC(u fc «£)||i = ^|K||i 



k=l k=l k=l k=l 



and 



|VEC(Xt)||oo = HVEC^a^' )||oo = H^ll^ = 1 



Moreover, by the way vectors are constructed, we have ||itfc||i = |$^|- In turn, < |<]>^| 
holds independent of the connectedness of the A:-th cluster. Putting together and upper bounding 
concludes the proof. 
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